Eatomics is an R-Shiny based web application that enables interactive exploration of quantitative proteomics data generated by MaxQuant software. Eatomics enables fast exploration of differential expression and pathway analysis to researchers with limited bioinformatics knowledge. The application aids in quality control of the quantitative proteomics data, visualization, differential expression and pathway analysis. Highlights of the application are an extensive experimental setup module, the data and report generation feature and the multiple ways to interact and customize the analysis.

1. Input files

Eatomics requires two file inputs:

  1. Demo_proteinGroups.txt: The proteinGroups.txt (i.e. a tab-separated files) as generated by the quantitative analysis software of raw mass spectrometry data - MaxQuant. The file should contain at least the columns Protein IDs, Majority protein IDs, Gene names, LFQ/iBAQ measurement columns, Reverse, Potential contaminant, Only identified by site. The latter three may be empty.

  2. Demo_clinicaldata.txt: The sample description file - a tab separated text file as can be produced with any Office program by saving the spread sheet as .txt. The file needs to contain a column named “PatientID”, which contains IDs that match the sample ID’s from the proteinGroups header (without the “LFQ intensity” or “iBAQ” prefixes) and one or more named columns with “parameters”, i.e. textual/factual/logical or continuous/integer values. Column names have to be unique.

Access to demo data is possible directly via the upload button if ou are testing on our public server. For your local installation you may directly use your own data or the demo files in Eatomics/Data from the github repository.

Examples of input files needed for Eatomics are an evidence file as produced by the MaxQuant algorithm (left) and a sample description file which may contain as many parameters as available.

Examples of input files needed for Eatomics are an evidence file as produced by the MaxQuant algorithm (left) and a sample description file which may contain as many parameters as available.

2. Application walk-through

Eatomics functionality is structured into four tab panels:

All tabs consist of a side panel to configure the analysis and a main panel for interactive analysis visualization.

Step 1: Load and Prepare

The first tab provides an overview on the data quality and enables filtering and preparation of data for differential expression and enrichment analysis ().

Configuration panel

Within the side panel the user can load data and configure quality control options.

Load proteinGroups.txt input file

To begin the analysis the user has to upload the MaxQuant file (e.g.proteinGroups.txt), as specified above. After full upload of the file, rows that were only found in the reverse database, belonging to potential contaminants or that have only been identified by site are filtered automatically.

Quality control and data cleansing

Load the sample description/clinical data file

Select and load the clinical data input file (e.g clinicaldata.txt), as specified above.

Configuration panel to load input data and to prepare the data set for analysis.

Configuration panel to load input data and to prepare the data set for analysis.

Visualization panel

In the main panel (right) interactive visualizations are shown.

Principal component analysis

A common method of dimensionality reduction is principal component analysis (PCA). Inherently, PCA calculates axes of most variation (principal components) within the expression data. A common assumption is that a plot along the axes of most variation will segregate all samples/patients into groups under investigation. The user can choose which principle components to visualize in the PCA and can choose to color the samples based on the uploaded sample/clinical characteristics.

Distribution overview

The distribution overview gives an impression on the sample-wise distribution of all measured intensities.

Protein coverage

Protein coverage describes the count of distinct protein groups per sample.

Sample to sample heatmap

The sample-to-sample heatmap describes the biological and technical variability of the samples. The user can choose to use Euclidean distance or Pearson correlation as a (dis-) similarity metric. Formed clusters should resemble the sample groups under investigation.

Cumulative Protein Intensities

Protein intensities are cumulated across all samples and plotted according to their relative abundance. Colouring marks the respective quantile of the proteins. Highly abundant proteins, i.e., proteins ranked in the first quartile are colored in red and labels are specified. The top 20 ranked proteins and their cumulated intensity are given in the table to the right.

Visualization of protein abundance in a PCA.

Visualization of protein abundance in a PCA.

Sample-wise distribution overview of protein abundance data.

Sample-wise distribution overview of protein abundance data.

Sample-wise coverage of protein abundance data.
Sample to sample heatmap.
Cumulative protein intensities of all samples.

Step 2: Differential expression

In step 2, the user is enabled to translate a given hypothesis on the data into an experimental design and to test the hypothesis. Eatomics uses limma to perform real time analysis of differentially expressed proteins amongst clinical parameters of choice. The resulting interactive visualization plot including volcano plots (detailed below) allows a quick and detailed overview on the differential expression. limma (linear models for microarray data), is a commonly used R/Bioconductor software package for analyzing microarray and RNA-seq data. Limma fits a linear model which can be parametrized in Eatomics elaborate experimental design module.

Experimental design configuration

Experimental design module with two categorical variables.

Experimental design module with two categorical variables.

Experimental design module with a continuous variable.

Experimental design module with a continuous variable.

Visualization of differentially expressed proteins

The result of differential expression analysis is displayed in an interactive volcano plot, two tables of up- and downregulated proteins and box and scatter plots of actual protein abundance.

Volcano plot

The volcano plot shows the log2 fold change value on the x-axis and the negative log10 of the Benjamini-Hochberg adjusted p-value on the y-axis. Significant results are shown in yellow. The threshold of log2 fold change and p-values considered significant can be set by the user directly within the threshold box. A hover over a dot in the volcanoe plot will display the respective gene name. A positive fold change can be interpreted as that protein being higher in abundance in the first selected group when compared to the second, in the case of a categorical response. When a continuous response is modeled, the fold change has to be interpreted as the slope, i.e., increase (positive log2 fold change) or decrease (negative log2 fold change), of the protein abundance with regard to a change of one unit of the response variable. For example, if age is analyzed, a log2 fold change of -0.2 would mean a decrease of about 1.15 (2^0.2) in LFQ intensity and thus protein abundance with every year of age.

Result tables and box/scatter plots

Significant results are listed in two tables below the volcano plot. They show the actual logFC, p-value and the adjusted p-value. A click on a protein entry in the data table produces a box (in the case of a categorical response variable) or scatter plot (in the case of a continuous response variable) showing the actual abundance values of selected proteins with regard to the tested comparisons.
The color of individual dots can be chosen to reflect a parameter from the sample description file.

Volcano plot of differential expression analysis and the threshold box for user-defined adjusted p-value and log2 fold change cutoff.

Volcano plot of differential expression analysis and the threshold box for user-defined adjusted p-value and log2 fold change cutoff.

Box plot of selected proteins as a result of a categorical response variable. Color of the points can be selected by the user from the sample description file.

Box plot of selected proteins as a result of a categorical response variable. Color of the points can be selected by the user from the sample description file.

Scatter plot with a linear fit through data points of a selected protein as a result of a continuous response variable. Color of the points can be selected by the user from the sample description file as in the box plot.

Scatter plot with a linear fit through data points of a selected protein as a result of a continuous response variable. Color of the points can be selected by the user from the sample description file as in the box plot.